Skip to content

Conversation

youkidearitai
Copy link
Contributor

Related: #18467

Maintenance of Oniguruma has ended, so I'll temporarily bundle it.
This pull request is one way of doing it, and this method may not always be adopted.

Maintenance of Oniguruma has ended, so I'll temporarily bundle it
@bukka
Copy link
Member

bukka commented Jul 27, 2025

Considering that Niels already added thumbs down, it needs RFC.

@youkidearitai
Copy link
Contributor Author

I created a Draft RFC.
https://wiki.php.net/rfc/eol-oniguruma

@alexdowad
What do you have any comments about this issue?

@alexdowad
Copy link
Contributor

@youkidearitai Thanks very much for drawing attention to this issue!

Many thanks to @kkos for his work on maintaining Oniguruma over the years. I'm sure it has been challenging.
@kkos, I don't wish to impose on your time, but if you are available to share a bit of your (extensive) knowledge about Oniguruma, it would be much appreciated.

I think our only options are 1) fork Oniguruma and maintain it ourselves, or 2) drop the mb_regex_* functions from mbstring.

Since PHP already has a built-in extension which bundles PCRE2, dropping Oniguruma would mean that PHP only includes one regex engine, rather than two. From a maintenance perspective, that seems desirable.

At the same time, it seems that Oniguruma does have some features which PCRE2 doesn't. From a quick, online search, it appears that OG:

  • Supports EUC-JP, Shift-JIS, and other legacy character encodings directly, while PCRE2 requires UTF-8/16/32.
  • Supports multiple regexp syntaxes, including Ruby, Java, etc
  • Allows multiple named capture groups with the same name
  • Supports "named subexpression call" (\g<name>)
  • ...and may have a couple of other features differing from PCRE2.

Looking at the history of OG, it appears there were a number of CVEs in 2019, but since then there have been few.

Code size: ~75,000 lines of C. About 25,000 are for the regexp engine; about 40,000 are Unicode data tables, and another 10,000 to support various text encodings.

@youkidearitai Do you have any ideas about how to find how many users are relying on mb_regex functions?

@youkidearitai
Copy link
Contributor Author

@alexdowad Thank you very much for investigation!

@youkidearitai Do you have any ideas about how to find how many users are relying on mb_regex functions?

For example, One idea is search to GitHub: https://github.com/search?q=mb_ereg+language%253APHP&type=code&l=PHP
Seems many use cases.

@alexdowad
Copy link
Contributor

alexdowad commented Aug 25, 2025

Just using Claude to analyze the OG repository and figure out where the key parts are and how they work.

Here's what it came up with: https://gist.github.com/alexdowad/b976a927aa908ca6f2bf76afe3d1a302

@alexdowad
Copy link
Contributor

@youkidearitai As a suggestion, instead of re-bundling the Oniguruma source into php-src, what if we forked OG into the php GitHub organization, and maintained the fork ourselves (security fixes only, no new features whatsoever)?
Then we wouldn't have to kill the mb_regex_* functions.
Or do you prefer to get rid of them?

@youkidearitai
Copy link
Contributor Author

@alexdowad Indeed, Forks Oniguruma in PHP organization is one way.
Internals mailing list (and me) likes rid them. But, If use case is many, I reconsider rid it.

@youkidearitai
Copy link
Contributor Author

Oniguruma has strengths for example

$  ~/php85/bin/php -r 'var_dump(mb_ereg("\A\d\z", "1"));'
bool(true)
$ ~/php85/bin/php -r 'var_dump(mb_ereg("\A\d\z", "1"));'
bool(true)
$ ~/php85/bin/php -r 'var_dump(mb_ereg("\A\w\z", "あ"));'
bool(true)
$ ~/php85/bin/php -r 'var_dump(mb_ereg("\A\w\z", "\x0"));'
bool(false)

\d and \w can matches full width characters.
And, mb_ereg functions are many use cases.
Our country, Web security text book uses mb_ereg( "安全なWebアプリケーションの作り方" ISBN: 978-4-7973-9316-3 https://wasbook.org )

It's not easy to get rid of Oniguruma.

@ryangjchandler
Copy link

I know I don't hold any weight in this discussion, but has Onigmo been considered as an alternative to Oniguruma? It's the same engine that Ruby uses and has support for some newer RegEx things (it's an Oniguruma fork).

(I'm the builder and maintainer of https://github.com/phikiphp/phiki which relies on mbregex functions, since TextMate grammars use Oniguruma-flavoured RegEx patterns).

@alexdowad
Copy link
Contributor

@ryangjchandler I don't know why you think that your words don't hold any weight; really, we are always interested in hearing from users.

Thanks for the suggestion about Onigmo.

@alexdowad
Copy link
Contributor

@ryangjchandler From what you know, would moving from Oniguruma to Onigmo be a breaking change in any way?
i.e. does it purely add new features to Oniguruma, or does it change the behavior of existing features as well?

@ryangjchandler
Copy link

@alexdowad To my knowledge it supports the same features as Oniguruma does, plus some extras to get it closer to more recent PCRE releases.

The internals / ABI might be different though - that I can't confirm without looking into it.

@youkidearitai
Copy link
Contributor Author

@ryangjchandler Thank you very much for your feedback!
I confirmed there are many use case in mbregex.

@youkidearitai
Copy link
Contributor Author

Onigumo is an interesting candidate for migration, but its transition stopped at Unicode 11.0 compared to Oniguruma.
However, Ruby uses Onigumo, and it appears they have applied patches internally within Ruby.
This is where we'd like some information from Ruby.
There are other differences, I think.

@RedCMD
Copy link

RedCMD commented Sep 4, 2025

oniguruma also supports the much loved feature variable length lookbehind
Ruby, Onigmo and PCRE don't support it

@ryangjchandler
Copy link

ryangjchandler commented Sep 4, 2025

@RedCMD Oh that's a good one to note then as a breaking change and also a problem for my Phiki package 😅

@bukka
Copy link
Member

bukka commented Sep 4, 2025

I think the important part here is if there are some volunteers to do any work on this which means either migration to Onigma (and possibly trying to backport some features added from Ruby as it doesn't look like it's getting regularly updated) or maintaining the Oniguruma fork (if it's bundled or separate lib does not make huge difference IMHO as one still needs to understand the library to be able to fix the potential security issues)?

The advantage of Onigma would be that it should be enough to track the Ruby version so a deeper understanding might not be necessary but as mentioned some features are missing so it might potentially break code. But it might be in some way like supporting multiple major version of any library where BC gets broken and it's up to the users (distros) to select what they want so maybe that's not such a big issue.

@alexdowad
Copy link
Contributor

@bukka I don't mind taking up the maintenance of a private Oniguruma fork, but with the condition that maintenance will include security fixes only, no new features whatsoever.

@youkidearitai
Copy link
Contributor Author

@alexdowad Fair enough.

If Alex said so, I agree and take up to maintenance Oniguruma. Because there are latent many demands.

@bukka
Copy link
Member

bukka commented Sep 4, 2025

@alexdowad So you would basically prefer library fork in php org that would get its own releases? Just to be clear, this sort of thing might also require keeping build working and possibly also fixing some crashes (even though those might not be necessarily security issues).

I would normally suggest not to use php org but considering the concerns in kkos/oniguruma#234 , it might show clearly that this is the version required for PHP. But we don't have any libraries in PHP org so it would be the first one and we would need to discuss it with admins internally.

We have got actually already for of oniguruma for Windows packaging in https://github.com/winlibs/oniguruma but I guess it's probably better to keep this for Windows packaging only as none of those libs in this org is used for anything else.

@alexdowad
Copy link
Contributor

@bukka Whatever the other core devs prefer is fine. if forking a library into the php org is not favored, I could fork it to my own GH account.

@youkidearitai
Copy link
Contributor Author

FYI: I forked and updated Unicode 17.0 in Oniguruma https://github.com/youkidearitai/oniguruma

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants